DS1000: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
gpt-4-turbo-2024-04-09 54.0% 88.6% 1197.4
gpt-4-0613 51.0% 85.7% 1149.5
microsoft-wavecoder-ultra-6.7b 46.0% 81.8% 1093.1
deepseek-ai-deepseek-coder-6.7b-instruct 45.8% 82.2% 1097.7
deepseek-ai-deepseek-coder-7b-instruct-v1.5 42.6% 78.1% 1054.9
m-a-p-OpenCodeInterpreter-DS-6.7B 42.0% 77.2% 1045.5
m-a-p-OpenCodeInterpreter-CL-7B 39.5% 72.7% 1002.1
gpt-3.5-turbo-0125 39.4% 72.2% 1003.2
m-a-p-OpenCodeInterpreter-SC2-7B 38.9% 69.8% 973.2
m-a-p-OpenCodeInterpreter-SC2-3B 38.6% 71.3% 988.6
gpt-3.5-turbo-0613 38.6% 71.8% 1000.0
codex002 38.6% 74.0% 1016.9
google-codegemma-7b 34.8% 67.9% 968.4
deepseek-ai-deepseek-coder-7b-base-v1.5 34.2% 66.8% 956.7
ibm-granite-granite-8b-code-base 33.8% 65.6% 945.4
microsoft-wavecoder-ds-6.7b 32.8% 63.1% 925.7
microsoft-Phi-3-mini-4k-instruct 32.1% 61.8% 919.1
meta-llama-Meta-Llama-3-8B 31.5% 60.8% 913.5
bigcode-starcoder2-7b 31.4% 61.0% 913.2
microsoft-Phi-3-mini-128k-instruct 31.3% 60.0% 906.7
microsoft-wavecoder-pro-6.7b 31.2% 59.8% 901.1
deepseek-ai-deepseek-coder-6.7b-base 31.1% 60.6% 905.5
ibm-granite-granite-8b-code-instruct 31.0% 59.8% 901.1
meta-llama-CodeLlama-13b-Python-hf 30.8% 59.8% 901.9
openchat-openchat-3.5-0106 30.3% 58.3% 888.4
google-codegemma-1.1-7b-it 29.7% 57.0% 882.6
meta-llama-Meta-Llama-3-8B-Instruct 29.3% 56.1% 876.3
deepseek-ai-deepseek-coder-1.3b-instruct 29.3% 56.0% 870.8
Qwen-CodeQwen1.5-7B 27.6% 52.8% 851.7
bigcode-starcoder2-3b 27.3% 52.2% 846.8
google-codegemma-7b-it 26.2% 49.8% 832.0
google-gemma-7b 26.1% 49.5% 828.8
meta-llama-CodeLlama-7b-Python-hf 26.0% 49.3% 827.6
stabilityai-stable-code-3b 25.6% 48.4% 822.5
m-a-p-OpenCodeInterpreter-DS-1.3B 25.0% 47.2% 810.7
stabilityai-stable-code-instruct-3b 24.7% 46.7% 804.6
meta-llama-CodeLlama-7b-Instruct-hf 24.5% 45.8% 805.2
THUDM-codegeex2-6b 24.1% 44.9% 795.5
claude-3-sonnet-20240229 23.2% 43.6% 789.8
meta-llama-CodeLlama-7b-hf 22.9% 41.9% 774.0
ibm-granite-granite-3b-code-base 22.8% 41.6% 770.8
claude-3-opus-20240229 21.6% 40.5% 767.6
microsoft-phi-2 21.5% 38.9% 750.6
ibm-granite-granite-3b-code-instruct 21.0% 37.4% 734.3
gpt-4o-2024-05-13 20.1% 37.9% 751.1
mistralai-Mistral-7B-Instruct-v0.2 20.0% 36.2% 730.9
Qwen-CodeQwen1.5-7B-Chat 19.5% 36.6% 737.5
google-gemma-1.1-7b-it 18.3% 32.5% 699.2
Salesforce-codegen25-7b-instruct_P 18.3% 32.7% 705.1
deepseek-ai-deepseek-coder-1.3b-base 17.5% 29.0% 671.6
google-codegemma-1.1-2b 16.6% 27.8% 661.9
claude-3-haiku-20240307 16.3% 28.8% 673.8
Salesforce-codegen25-7b-mono_P 15.6% 26.3% 651.7
google-codegemma-2b 13.3% 20.6% 594.3
google-gemma-7b-it 11.4% 17.1% 551.6
google-gemma-2b 10.3% 14.0% 511.1
microsoft-phi-1 9.1% 11.8% 474.1
google-gemma-1.1-2b-it 8.5% 12.4% 486.0
microsoft-phi-1_5 8.3% 11.7% 472.1
meta-llama-Llama-2-7b-hf 6.9% 9.0% 418.3
meta-llama-Llama-2-7b-chat-hf 6.4% 8.0% 399.1
google-gemma-2b-it 6.0% 7.8% 394.0
smallcloudai-Refact-1_6B-fim 5.7% 7.9% 393.2